Overview

Modern data science is impossible without some understanding of the Unix command line. Unix is a family of computer operating systems including the Mac’s OS X and Linux (technically, Linux is a Unix clone); Windows has Unix emulators, which allow running Unix commands. We will use the terms Unix and Linux interchangeably to mean operating systems that use the unix shell commands--our present topic.

As one’s proficiency with the unix shell increases, so too does efficiency in completing and automating many tasks. This document is a tutorial in some of the basic unix command-line utilities used for data gathering, searching, cleaning and summarizing. Generally, unix commands are very efficient, and can be used to process data that is quite large, beyond what can be loaded into your computer’s main memory, and can easily handle workloads far exceeding the capabilities of tools like Excel.

Once you have access to the terminal, try it out! Type pwd. This will tell you your current directory. If you want to know the contents of this directory, type ls -A. We'll go on to get into some real meat. Before we get started, let's first get some useful data to process.

Data

Many of the shell scripting examples below are performed on the following example data (downloadable here):

123 1346699925 11122 foo bar 222 1346699955 11145 biz baz 140 1346710000 11122 hee haw 234 1346700000 11135 bip bop 146 1346699999 11123 foo bar 99 1346750000 11135 bip bop 99 1346750000 11135 bip bop

The columns in this tab-separated data correspond to [order id] [time of order] [user id] [ordered item], something similar to what might be encountered in practice. If you wish, you can copy-paste the data written above into a text editor, making sure there is a newline following each of the ordered item columns (the columns with alphabetic characters).

Alternately, the sample data file is hosted online. You can use terminal commands to copy this remote file. Simply type: wget https://raw.githubusercontent.com/jattenberg/PDS-Spring-2014/master/data/sample.txt like so:



In [1]:

    
!wget https://raw.githubusercontent.com/jattenberg/PDS-Spring-2014/master/data/sample.txt









    



--2014-05-21 20:33:55--  https://raw.githubusercontent.com/jattenberg/PDS-Spring-2014/master/data/sample.txt
Resolving raw.githubusercontent.com... 23.235.46.133
Connecting to raw.githubusercontent.com|23.235.46.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 201 [text/plain]
Saving to: ‘sample.txt’

100%[======================================>] 201         --.-K/s   in 0s      

2014-05-21 20:33:55 (12.0 MB/s) - ‘sample.txt’ saved [201/201]

This will pull the file to the active directory in the current terminal, creating a new file called “sample.txt”. Note in some systems, wget may not be installed. in this case, you can try: curl https://raw.githubusercontent.com/jattenberg/PDS-Spring-2014/master/data/sample.txt -o sample.txt

Command-line Utilities

This section gives some crucial unix utilities, ordered roughly according to their usefulness to the data scientist. This list is by no means exhaustive, and the ordering is not perfect; different tasks have different demands. Fortunately, unix has been around for a while and has an extremely active user base, developing a wide range of utilities for common data processing, networking, system management, and automation tasks.

Once you are familiar with programming, you will be able to write your own scripts that can perform tasks which you are unable to accomplish using existing unix utilities. The tradeoff between writing hand-coded scripts and existing unix utilities is an increase in flexibility at the expense of increased development time, and therefore a reduction in the speed of iteration.

We will talk about sending the output of one command to another below (“pipes”), but an important command-line operator is the “redirection” operator “>”. With “>” you can send the result of your command-line processing to a file. So if you’re using grep (described next) to find all the lines that contain “foo”, you can create a new file with just these lines using redirection:

grep ‘foo’ orig_file.txt > new_file.txt

Very useful. What follows is a list of some of the more useful linux utilities

`grep`:

A utility for pattern matching. grep is by far the most useful unix utility. While grep is conceptually very simple, an effective developer or data scientist will no doubt find themselves using grep dozens of times a day. grep is typically called like this: grep [options] [pattern] [files]. With no options specified, this simply looks for the specified pattern in the given files, printing to the console only those lines that match the given pattern. Example:



In [4]:

    
!grep 'foo bar' sample.txt









    



123	1346699925	11122	foo bar
146	1346699999	11123	foo bar

This in itself can be very useful, scraping large volumes of data to find what you’re looking for.

The power of grep really shows when different command options are specified. Below are just a sample of the more useful grep options:

-v: Inverted matching. In this setting, grep will return all the input lines that do not match the specified pattern. Example



In [5]:

    
!grep -v 'foo bar' sample.txt









    



222	1346699955	11145	biz baz
140	1346710000	11122	hee haw
234	1346700000	11135	bip bop
99	1346750000	11135	bip bop
99	1346750000	11135	bip bop

-R: Recursive matching. Here grep descends sub folders, applying the pattern on all files encountered. Very useful if you’re looking to see if any logs have lines that you’re interested in, or to find the source code file containing the function you’re interested in. Example:



In [1]:

    
!cd .. #this will bring you to “up” one folder
!grep -R 'hee haw' . # here . refers to the current directory.
                     # includes matches from ipython nb files!









    



./.ipynb_checkpoints/Basic Unix Shell Commands for the Data Scientist-checkpoint.ipynb:      "140\t1346710000\t11122\thee haw\n",
./.ipynb_checkpoints/Basic Unix Shell Commands for the Data Scientist-checkpoint.ipynb:        "140\t1346710000\t11122\thee haw\r\n",
./Basic Unix Shell Commands for the Data Scientist.ipynb:      "140\t1346710000\t11122\thee haw\n",
./Basic Unix Shell Commands for the Data Scientist.ipynb:        "140\t1346710000\t11122\thee haw\r\n",
./sample.txt:140	1346710000	11122	hee haw

-P: Perl Regular Expressions Here patterns are perl regular expressions. This gives the user the ability to match extremely flexible patterns. Example:



In [4]:

    
!grep -P '23\t+foo' sample.txt

`sort`

An extremely efficient implementation of external merge sort. In a nutshell, this means the sort utility can order a dataset far larger than can fit in a system’s main memory. While sorting extremely large files does drastically increase the runtime, smaller files are sorted quickly. Typically called like: sort [options] [file]. Example:



In [5]:

    
!sort sample.txt









    



123	1346699925	11122	foo bar
140	1346710000	11122	hee haw
146	1346699999	11123	foo bar
222	1346699955	11145	biz baz
234	1346700000	11135	bip bop
99	1346750000	11135	bip bop
99	1346750000	11135	bip bop

Useful both as a component of larger shell scripts, and independently, as a tool to, say, quickly find the most active users, or to see the most frequently loaded pages on a domain. Some useful options:

-r: reverse order. Sort the input in descending order:



In [6]:

    
!sort -r sample.txt









    



99	1346750000	11135	bip bop
99	1346750000	11135	bip bop
234	1346700000	11135	bip bop
222	1346699955	11145	biz baz
146	1346699999	11123	foo bar
140	1346710000	11122	hee haw
123	1346699925	11122	foo bar

-n: numeric order. Sort the input in numerical order as opposed to the default lexicographical order:



In [7]:

    
!sort -n sample.txt









    



99	1346750000	11135	bip bop
99	1346750000	11135	bip bop
123	1346699925	11122	foo bar
140	1346710000	11122	hee haw
146	1346699999	11123	foo bar
222	1346699955	11145	biz baz
234	1346700000	11135	bip bop

-k n: sort the input according to the values in the n-th column. Useful for columnar data. See also the -t option to specify the text used to specify columns:



In [8]:

    
!sort -k 2 sample.txt









    



123	1346699925	11122	foo bar
222	1346699955	11145	biz baz
146	1346699999	11123	foo bar
234	1346700000	11135	bip bop
140	1346710000	11122	hee haw
99	1346750000	11135	bip bop
99	1346750000	11135	bip bop

`uniq`

Remove sequential duplicates: prints only those unique sequential lines from a file. Example:



In [9]:

    
!uniq sample.txt









    



123	1346699925	11122	foo bar
222	1346699955	11145	biz baz
140	1346710000	11122	hee haw
234	1346700000	11135	bip bop
146	1346699999	11123	foo bar
99	1346750000	11135	bip bop

Used with the -c option, uniq will report the number of duplicates of each line in the sequence. Example:



In [10]:

    
!uniq -c sample.txt









    



   1 123	1346699925	11122	foo bar
   1 222	1346699955	11145	biz baz
   1 140	1346710000	11122	hee haw
   1 234	1346700000	11135	bip bop
   1 146	1346699999	11123	foo bar
   2 99	1346750000	11135	bip bop

`cut`:

Used to select or “cut” certain fields (usually columns) from input. Cut is typically used with the -f option to specify a comma-separated list of columns to be emitted. Example:



In [11]:

    
!cut -f2,4 sample.txt









    



1346699925	foo bar
1346699955	biz baz
1346710000	hee haw
1346700000	bip bop
1346699999	foo bar
1346750000	bip bop
1346750000	bip bop

An important option with the cut utility is -d, which is used to specify the string used to separate the fields in the input. While the default value of tab is appropriate for our sample file, if spaces were used instead of tabs, we could change the above command to: cut -d" " -f2,4 sample.txt

`cat`:

Concatenate the contents of the specified files to standard output. Example:



In [12]:

    
!cat sample.txt









    



123	1346699925	11122	foo bar
222	1346699955	11145	biz baz
140	1346710000	11122	hee haw
234	1346700000	11135	bip bop
146	1346699999	11123	foo bar
99	1346750000	11135	bip bop
99	1346750000	11135	bip bop

`ls`:

Lists the contents of a directory or provide information about the specified file. Typical usage:

ls [options] [files or directories]

By default, ls simply lists the contents of the current directory. There are several options that when used in conjunction with ls give more detailed information about the files or directories being queried. Here are a sample:

-A: list all of the contents of the queried directory, even hidden files.
-l: detailed format, display additional info for all files and directories.
-R: recursively list the contents of any subdirectories.
-t: sort files by the time of the last modification.
-S: sort files by size.
-r: reverse any sort order.
-h: when used in conjunction with -l, gives a more human-readable output.

`cd`:

Change the current directory. Usage: cd [directory to move to]

`head`/`tail`:

Output the first (last) lines of a file. Typically used like:



In [15]:

    
!head -n 5 sample.txt









    



123	1346699925	11122	foo bar
222	1346699955	11145	biz baz
140	1346710000	11122	hee haw
234	1346700000	11135	bip bop
146	1346699999	11123	foo bar



In [16]:

    
!tail -n 5 sample.txt









    



140	1346710000	11122	hee haw
234	1346700000	11135	bip bop
146	1346699999	11123	foo bar
99	1346750000	11135	bip bop
99	1346750000	11135	bip bop

The -n option specifies the number of lines to be output, the default value is 10. tail, when used with the -f option, will output the end of a file as it is written to. This is useful is a program is writing output or logging progress to a file, and you want to read it as it is happening.

`less`:

Navigate through the contents of a file or through the output of another script or utility. When invoked like: less [some big file]. less enters an interactive mode. In this mode, several keys help you navigate the input file. Some key commands are:

(space): space navigates forward one screen.
(enter): enter navigates forward one line.
b: navigates backwards one screen
y: navigates backwards one line.
/[pattern]: search forwards for the next occurrence of [pattern]
?[pattern]: search backwards for the previous occurrence of [pattern]

Where [pattern] can be a basic string or a regular expression.

`wc`:

Compute word, line, and byte counts for specified files or output of other scripts. Particularly useful when used in concert with other utilities such as grep, sort, and uniq. Example usage:



In [17]:

    
!wc sample.txt









    



       7      35     201 sample.txt

Indicating the number of lines, words, and bytes in the file respectively. There are some useful flags for wc that will help you answer specific questions quickly:

-l: get the number of lines from the input. Example:



In [18]:

    
!wc -l sample.txt









    



       7 sample.txt

-w: get the number of words in the input. Example:



In [19]:

    
!wc -w sample.txt









    



      35 sample.txt

-m: the number of characters in the input. Example:



In [21]:

    
!wc -m sample.txt









    



     201 sample.txt

-c: the number of bytes in the input. Example:



In [22]:

    
!wc -c sample.txt









    



     201 sample.txt

Here, the number of bytes and characters are the same; all characters used are just one byte.

Pipes

Pipes provide a way of connecting the output of one unix program or utility to the input of another, through standard input and output. Unix pipes give you the power to compose various utilities into a data flow and use your creativity to solve problems. Utilities are connected together ("piped" together) via the pipe operator, |. For instance, if you want to know how many records in the sample data file do not contain "foo bar", you can compose a data flow like this:



In [24]:

    
!cat sample.txt | grep -v 'foo bar' | wc -l

Using wc at the end of a pipe to count the number of matching output records is a common pattern. Recalling that uniq removes any sequential duplicates, we can count the number of unique users making purchases in our file by composing a data flow like this:



In [25]:

    
!cat sample.txt | cut -f3 | sort | uniq  | wc -l

Or, if you want count how many transactions each user has appeared in:



In [26]:

    
!cat sample.txt | cut -f3 | sort | uniq -c

To now order the users by number of transactions made, you can try something like:



In [27]:

    
!cat sample.txt | cut -f3 | sort | uniq -c | sort -nr

Notice here, that the -r and -n flags for the sort command are combined. This is common shorthand and is acceptable for any unix utility.

More Useful Command Line Utilities:

xargs: used for building and executing terminal commands. Often used to read input from a pipe, and perform the same command on each line read from the pipe. For instance, if we want to look up all of the .txt files in a directory and concatenate them, we can use xargs:



In [29]:

    
!ls . | grep '.txt' | xargs cat









    



123	1346699925	11122	foo bar
222	1346699955	11145	biz baz
140	1346710000	11122	hee haw
234	1346700000	11135	bip bop
146	1346699999	11123	foo bar
99	1346750000	11135	bip bop
99	1346750000	11135	bip bop

find: search directories for matching files. Useful when you know the name of a file (or part of the name), but do not know the file’s location in a directory. Example:



In [34]:

    
!find . -name 'sample.txt'









    



./sample.txt

sed: A feature-rich stream editor. Useful for performing simple transformations on an input stream- input from a pipe or from a file. For instance, if we want to replace the space in the fourth column of our sample input with an underscore, we can use sed:



In [35]:

    
!cat sample.txt | sed 's/ /_/'









    



123	1346699925	11122	foo_bar
222	1346699955	11145	biz_baz
140	1346710000	11122	hee_haw
234	1346700000	11135	bip_bop
146	1346699999	11123	foo_bar
99	1346750000	11135	bip_bop
99	1346750000	11135	bip_bop

screen: Manager for terminal screens. Can be used to “re-attach” terminal sessions so you can continue your work after logging out, etc. Particularly useful when working on a remote server.
top: displays currently running tasks and their resource utilization.
fmt a simple text formatter, often used for limiting the width of lines in a file. Typically useage uses a -width flag, where width is a positive integer denoting the number of words to go on each output line, where words are sequences of non-whitespace characters. For instance, if we want to get all the individual "words" for our sample input file, one word per line, we can use (using head to limit output):



In [36]:

    
!fmt -1 sample.txt | head

Pick your Text Editor

There are a rich set of editors available in the terminal that are useful for exploring and modifying files in addition to writing source code for programming languages. nano is the simplest common text editor, vim and emacs are both far more complex and far more feature-rich. Choosing vim or emacs entails climbing a learning curve- there are many special key combinations that do useful things and special modes optimized for certain common tasks. However, this power, once mastered can greatly increase your effectiveness as a programmer, greatly reducing your time between iterations.

For experienced programmers, choosing an editor is almost like choosing a religion: one is right and all others are wrong. Some programmers are very vocal about this. However, in the end of the day, all editors do the same things, albeit offering different paths to get there. When you feel you’re ready to try out a new text editor, my advice is pick one that your friends or colleagues are familiar with. They can get you on your feet quickly with a few useful tips, and get you unstuck when you run into trouble

Overview

Data

Command-line Utilities

grep:

sort

uniq

cut:

cat:

ls:

cd:

head/tail:

less:

wc: